Exploring the Online Retail Dataset
This workbook was created using the “dataexpks” template:
https://github.com/DublinLearningGroup/dataexpks
1 Introduction
This workbook performs the basic data exploration of the dataset.
2 Load Data
First we load the dataset as well as some support datasets.
## Rows: 1,067,371
## Columns: 9
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "…
## $ Invoice <chr> "489434", "489434", "489434", "489434", "489434", "4894…
## $ StockCode <chr> "85048", "79323P", "79323W", "22041", "21232", "22064",…
## $ Description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIG…
## $ Quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18,…
## $ InvoiceDate <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 0…
## $ Price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3…
## $ `Customer ID` <chr> "13085", "13085", "13085", "13085", "13085", "13085", "…
## $ Country <chr> "United Kingdom", "United Kingdom", "United Kingdom", "…
2.1 Perform Quick Data Cleaning
Some of the dates provided in the dataset are in an irregular format.
## Rows: 1,067,371
## Columns: 9
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Y…
## $ invoice <chr> "489434", "489434", "489434", "489434", "489434", "48943…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", …
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGH…
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, …
## $ invoice_date <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", "1…
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom", "U…
2.2 Create Derived Variables
We now create derived features useful for modelling. These values are new variables calculated from existing variables in the data.
## Rows: 1,067,371
## Columns: 21
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "220…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY…
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10,…
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 20…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.5…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085…
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom…
## $ stock_code_upr <chr> "85048", "79323P", "79323W", "22041", "21232", "220…
## $ cancellation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ invoice_dttm <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-…
## $ invoice_month <chr> "December", "December", "December", "December", "De…
## $ invoice_dow <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesda…
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01…
## $ invoice_hour <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07…
## $ invoice_minute <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45…
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49…
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "…
## $ stock_amount <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 5…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.0…
3 Perform Basic Checks on Data
3.1 Check/Remove Duplicated Data
Now that we have data across multiple sheets in the Excel file we need to see if the data is duplicated across the sheets.
data_tbl %>%
group_by(excel_sheet) %>%
summarise(
.groups = "drop",
min_date = min(invoice_date),
max_date = max(invoice_date)
) %>%
print()## # A tibble: 2 x 3
## excel_sheet min_date max_date
## <chr> <date> <date>
## 1 Year 2009-2010 2009-12-01 2010-12-09
## 2 Year 2010-2011 2010-12-01 2011-12-09
There appears to be some overlap for the two sheets for the first 10 days or so of December. It is possible this data is not duplicated though, so we will check that.
data_tbl %>%
group_by(excel_sheet, invoice_date, invoice_id, stock_code) %>%
mutate(row_id = 1:n()) %>%
ungroup() %>%
filter(row_id > 1) %>%
arrange(invoice_date, invoice_id, invoice_dttm, excel_sheet)## # A tibble: 24,019 x 22
## excel_sheet invoice_id stock_code description quantity invoice_date price
## <chr> <chr> <chr> <chr> <dbl> <date> <dbl>
## 1 Year 2009-… 489488 22125 UNION JACK… 1 2009-12-01 5.95
## 2 Year 2009-… 489517 21912 VINTAGE SN… 1 2009-12-01 3.75
## 3 Year 2009-… 489517 22130 PARTY CONE… 6 2009-12-01 0.85
## 4 Year 2009-… 489517 22319 HAIRCLIPS … 12 2009-12-01 0.65
## 5 Year 2009-… 489517 21913 VINTAGE SE… 1 2009-12-01 3.75
## 6 Year 2009-… 489517 21821 GLITTER ST… 1 2009-12-01 3.75
## 7 Year 2009-… 489517 84951A S/4 PISTAC… 1 2009-12-01 2.55
## 8 Year 2009-… 489517 21491 SET OF THR… 1 2009-12-01 1.95
## 9 Year 2009-… 489517 21791 VINTAGE HE… 1 2009-12-01 1.25
## 10 Year 2009-… 489517 21790 VINTAGE SN… 1 2009-12-01 0.85
## # … with 24,009 more rows, and 15 more variables: customer_id <chr>,
## # country <chr>, stock_code_upr <chr>, cancellation <lgl>,
## # invoice_dttm <dttm>, invoice_month <chr>, invoice_dow <chr>,
## # invoice_dom <chr>, invoice_hour <chr>, invoice_minute <chr>,
## # invoice_woy <chr>, invoice_ym <chr>, stock_amount <dbl>,
## # invoice_monthprop <dbl>, row_id <int>
It looks like there are about 24,000 rows of data that are duplicated, so we can remove them.
data_tbl <- data_tbl %>%
group_by(invoice_date, invoice_id, stock_code) %>%
mutate(row_id = 1:n()) %>%
ungroup() %>%
filter(row_id == 1) %>%
select(-row_id)
data_tbl %>% glimpse()## Rows: 1,021,424
## Columns: 21
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Year 200…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "489434", "489434", "489434", "489435", "489435", "489435", "489435", "489436", "4894…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "21871", "21523", "22350", "22349", "22195", "22353", "48173C", "21755", "21754…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHTS", "WHITE CHERRY LIGHTS", "RECORD FRAME 7\" SINGLE SIZE", "STRAWBERRY CERAMIC…
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3, 16, 3, 12, 12, 12, 16, 4, 2, 12, 12, 12, 3, 6, 8, 8, 24, 6, 6, 12, 2, 1, 2, …
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 20…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.75, 1.65, 2.55, 5.95, 5.45, 5.95, 1.69, 6.95, 1.45, 1.65, 1.65, 3.39, 3.75, 8.5…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", "13085", "13085", "13085", "13085", "13085", "13085", "13078", "13078", "13078", …
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom", "United Kingdom", "United Kingdom", "United Kingdom", "United Kingdom", "United K…
## $ stock_code_upr <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "21871", "21523", "22350", "22349", "22195", "22353", "48173C", "21755", "21754…
## $ cancellation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ invoice_dttm <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-…
## $ invoice_month <chr> "December", "December", "December", "December", "December", "December", "December", "December", "December", "December", "December", "De…
## $ invoice_dow <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tu…
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01…
## $ invoice_hour <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07", "07", "07", "07", "09", "09", "09", "09", "09", "09", "09", "09", "09", "09", "09…
## $ invoice_minute <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45", "45", "45", "45", "05", "05", "05", "05", "05", "05", "05", "05", "05", "05", "05…
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49…
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "2009…
## $ stock_amount <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, 30.60, 45.00, 39.60, 30.60, 59.50, 98.10, 17.85, 27.04, 20.85, 17.40, 19.80, 1…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.0…
3.2 Check Dates
We now want to check for missing dates in this dataset.
data_tbl %>%
select(invoice_date, invoice_dow) %>%
distinct() %>%
mutate(date_diff = invoice_date - lag(invoice_date)) %>%
filter(date_diff > 1) %>%
datatable()It appears that there are no missing dates in this dataset, though we may need to account for the fact that the business is not open all the time as there are gaps in dates for one day every weekend, as well as being closed over the Easter and Christmas holidays.
For our time series analysis on a daily basis we may be able to ignore this wrinkle in our dataset.
3.3 Check Missing Values
Before we do anything with the data, we first check for missing values in the dataset. In some cases, missing data is coded by a special character rather than as a blank, so we first correct for this.
With missing data properly encoded, we now visualise the missing data in a number of different ways.
3.3.1 Univariate Missing Data
We first examine a simple univariate count of all the missing data:
row_count <- data_tbl %>% nrow()
missing_univariate_tbl <- data_tbl %>%
summarise_all(list(~sum(are_na(.)))) %>%
gather("variable", "missing_count") %>%
mutate(missing_prop = missing_count / row_count)
ggplot(missing_univariate_tbl) +
geom_bar(aes(x = fct_reorder(variable, -missing_prop),
weight = missing_prop)) +
xlab("Variable") +
ylab("Missing Value Proportion") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))We remove all variables where all of the entries are missing
remove_vars <- missing_univariate_tbl %>%
filter(missing_count == row_count) %>%
pull(variable)
lessmiss_data_tbl <- data_tbl %>%
select(-one_of(remove_vars))With these columns removed, we repeat the exercise.
missing_univariate_tbl <- lessmiss_data_tbl %>%
summarise_all(list(~sum(are_na(.)))) %>%
gather("variable", "missing_count") %>%
mutate(missing_prop = missing_count / row_count)
ggplot(missing_univariate_tbl) +
geom_bar(aes(x = fct_reorder(variable, -missing_prop),
weight = missing_prop)) +
xlab("Variable") +
ylab("Missing Value Proportion") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))To reduce the scale of this plot, we look at the top twenty missing data counts.
missing_univariate_top_tbl <- missing_univariate_tbl %>%
arrange(desc(missing_count)) %>%
top_n(n = 50, wt = missing_count)
ggplot(missing_univariate_top_tbl) +
geom_bar(aes(x = fct_reorder(variable, -missing_prop),
weight = missing_prop)) +
xlab("Variable") +
ylab("Missing Value Proportion") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))3.3.2 Multivariate Missing Data
It is useful to get an idea of what combinations of variables tend to have variables with missing values simultaneously, so to construct a visualisation for this we create a count of all the times given combinations of variables have missing values, producing a heat map for these combination counts.
row_count <- rawdata_tbl %>% nrow()
count_nas <- ~ .x %>% are_na() %>% vec_cast(integer())
missing_plot_tbl <- rawdata_tbl %>%
mutate_all(count_nas) %>%
mutate(label = pmap_chr(., str_c)) %>%
group_by(label) %>%
summarise_all(list(sum)) %>%
arrange(desc(label)) %>%
select(-label) %>%
mutate(label_count = pmap_int(., pmax)) %>%
gather("col", "count", -label_count) %>%
mutate(miss_prop = count / row_count,
group_label = sprintf("%6.4f", round(label_count / row_count, 4))
)
ggplot(missing_plot_tbl) +
geom_tile(aes(x = col, y = group_label, fill = miss_prop), height = 0.8) +
scale_fill_continuous() +
scale_x_discrete(position = "top") +
xlab("Variable") +
ylab("Missing Value Proportion") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))This visualisation takes a little explaining.
Each row represents a combination of variables with simultaneous missing values. For each row in the graphic, the coloured entries show which particular variables are missing in that combination. The proportion of rows with that combination is displayed in both the label for the row and the colouring for the cells in the row.
3.4 Inspect High-level-count Categorical Variables
With the raw data loaded up we now remove obvious unique or near-unique variables that are not amenable to basic exploration and plotting.
coltype_lst <- create_coltype_list(data_tbl)
count_levels <- ~ .x %>% unique() %>% length()
catvar_valuecount_tbl <- data_tbl %>%
summarise_at(coltype_lst$split$discrete, count_levels) %>%
gather("var_name", "level_count") %>%
arrange(-level_count)
print(catvar_valuecount_tbl)## # A tibble: 14 x 2
## var_name level_count
## <chr> <int>
## 1 invoice_id 53628
## 2 customer_id 5943
## 3 description 5656
## 4 stock_code 5304
## 5 stock_code_upr 5131
## 6 invoice_minute 60
## 7 invoice_woy 52
## 8 country 43
## 9 invoice_dom 31
## 10 invoice_ym 25
## 11 invoice_hour 16
## 12 invoice_month 12
## 13 invoice_dow 7
## 14 excel_sheet 2
## Dataset has 1021424 rows
Now that we a table of the counts of all the categorical variables we can automatically exclude unique variables from the exploration, as the level count will match the row count.
unique_vars <- catvar_valuecount_tbl %>%
filter(level_count == row_count) %>%
pull(var_name)
print(unique_vars)## character(0)
Having removed the unique identifier variables from the dataset, we may also wish to exclude categoricals with high level counts also, so we create a vector of those variable names.
highcount_vars <- catvar_valuecount_tbl %>%
filter(level_count >= level_exclusion_threshold,
level_count < row_count) %>%
pull(var_name)
cat(str_c(highcount_vars, collapse = ", "))## invoice_id, customer_id, description, stock_code, stock_code_upr
We now can continue doing some basic exploration of the data. We may also choose to remove some extra columns from the dataset.
### You may want to comment out these next few lines to customise which
### categoricals are kept in the exploration.
drop_vars <- c(highcount_vars)
if (length(drop_vars) > 0) {
explore_data_tbl <- explore_data_tbl %>%
select(-one_of(drop_vars))
cat(str_c(drop_vars, collapse = ", "))
}## invoice_id, customer_id, description, stock_code, stock_code_upr
4 Univariate Data Exploration
Now that we have loaded the data we can prepare it for some basic data exploration. We first exclude the variables that are unique identifiers or similar, and tehen split the remaining variables out into various categories to help with the systematic data exploration.
## $split
## $split$continuous
## [1] "quantity" "price" "stock_amount" "invoice_monthprop"
##
## $split$datetime
## [1] "invoice_date" "invoice_dttm"
##
## $split$discrete
## [1] "excel_sheet" "country" "invoice_month" "invoice_dow" "invoice_dom" "invoice_hour" "invoice_minute" "invoice_woy" "invoice_ym"
##
## $split$logical
## [1] "cancellation"
##
##
## $columns
## excel_sheet quantity invoice_date price country cancellation invoice_dttm invoice_month invoice_dow
## "discrete" "continuous" "datetime" "continuous" "discrete" "logical" "datetime" "discrete" "discrete"
## invoice_dom invoice_hour invoice_minute invoice_woy invoice_ym stock_amount invoice_monthprop
## "discrete" "discrete" "discrete" "discrete" "discrete" "continuous" "continuous"
4.1 Logical Variables
Logical variables only take two values: TRUE or FALSE. It is useful to see missing data as well though, so we also plot the count of those.
logical_vars <- coltype_lst$split$logical %>% sort()
for (plot_varname in logical_vars) {
cat("--\n")
cat(glue("{plot_varname}\n"))
na_count <- explore_data_tbl %>% pull(!! plot_varname) %>% are_na() %>% sum()
explore_plot <- ggplot(explore_data_tbl) +
geom_bar(aes(x = !! sym(plot_varname))) +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c("Barplot of Counts for Variable: ", plot_varname,
" (", na_count, " missing values)")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
plot(explore_plot)
}## --
## cancellation
4.2 Numeric Variables
Numeric variables are usually continuous in nature, though we also have integer data.
numeric_vars <- coltype_lst$split$continuous %>% sort()
for (plot_varname in numeric_vars) {
cat("--\n")
cat(glue("{plot_varname}\n"))
plot_var <- explore_data_tbl %>% pull(!! plot_varname)
na_count <- plot_var %>% are_na() %>% sum()
plot_var %>% summary() %>% print()
neg_data_tbl <- explore_data_tbl %>%
filter(!! sym(plot_varname) < 0) %>%
mutate(var_val = abs(!! sym(plot_varname)))
pos_data_tbl <- explore_data_tbl %>%
filter(!! sym(plot_varname) >= 0) %>%
mutate(var_val = abs(!! sym(plot_varname)))
all_plot <- ggplot(explore_data_tbl) +
geom_histogram(aes(x = !! sym(plot_varname)), bins = hist_bins_count) +
geom_vline(xintercept = mean(plot_var, na.rm = TRUE),
colour = "red", size = 1.5) +
geom_vline(xintercept = median(plot_var, na.rm = TRUE),
colour = "green", size = 1.5) +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(glue("Histogram Plot for Variable: {plot_varname} ({na_count} missing values)"),
subtitle = "(red line is mean, green line is median)")
pos_log_plot <- ggplot(pos_data_tbl) +
geom_histogram(aes(x = var_val), bins = hist_bins_count) +
xlab(plot_varname) +
ylab("Count") +
scale_x_log10(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle("Positive Values")
neg_log_plot <- ggplot(neg_data_tbl) +
geom_histogram(aes(x = var_val), bins = hist_bins_count) +
xlab(plot_varname) +
ylab("Count") +
scale_x_log10(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle("Negative Values")
plot_grid(
all_plot,
NULL,
pos_log_plot,
neg_log_plot,
nrow = 2
) %>%
print()
}## --
## invoice_monthprop Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03226 0.28571 0.53333 0.52621 0.76667 1.00000
## --
## price Min. 1st Qu. Median Mean 3rd Qu. Max.
## -53594.36 1.25 2.10 4.62 4.15 38970.00
## --
## quantity Min. 1st Qu. Median Mean 3rd Qu. Max.
## -80995.00 1.00 3.00 10.13 10.00 80995.00
## --
## stock_amount Min. 1st Qu. Median Mean 3rd Qu. Max.
## -168469.60 3.75 9.95 18.37 17.70 168469.60
4.3 Categorical Variables
Categorical variables only have values from a limited, and usually fixed, number of possible values
categorical_vars <- coltype_lst$split$discrete %>% sort()
for (plot_varname in categorical_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
na_count <- explore_data_tbl %>% pull(!! plot_varname) %>% are_na() %>% sum()
plot_tbl <- explore_data_tbl %>%
pull(!! plot_varname) %>%
fct_lump(n = cat_level_count) %>%
fct_count() %>%
mutate(f = fct_relabel(f, str_trunc, width = 15))
explore_plot <- ggplot(plot_tbl) +
geom_bar(aes(x = f, weight = n)) +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c("Barplot of Counts for Variable: ", plot_varname,
" (", na_count, " missing values)")) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
plot(explore_plot)
}## --
## country
## --
## excel_sheet
## --
## invoice_dom
## --
## invoice_dow
## --
## invoice_hour
## --
## invoice_minute
## --
## invoice_month
## --
## invoice_woy
## --
## invoice_ym
4.4 Date/Time Variables
Date/Time variables represent calendar or time-based data should as time of the day, a date, or a timestamp.
datetime_vars <- coltype_lst$split$datetime %>% sort()
for (plot_varname in datetime_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_var <- explore_data_tbl %>% pull(!! plot_varname)
na_count <- plot_var %>% are_na() %>% sum()
plot_var %>% summary() %>% print()
explore_plot <- ggplot(explore_data_tbl) +
geom_histogram(aes(x = !! sym(plot_varname)),
bins = hist_bins_count) +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c("Barplot of Dates/Times in Variable: ", plot_varname,
" (", na_count, " missing values)"))
plot(explore_plot)
}## --
## invoice_date
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2009-12-01" "2010-07-05" "2010-12-09" "2011-01-02" "2011-07-27" "2011-12-09"
## --
## invoice_dttm
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2009-12-01 07:45:00" "2010-07-05 11:11:00" "2010-12-09 14:09:00" "2011-01-03 13:28:54" "2011-07-27 09:32:00" "2011-12-09 12:50:00"
5 Bivariate Data Exploration
We now move on to looking at bivariate plots of the data set.
5.1 Facet Plots on Variables
A natural way to explore relationships in data is to create univariate visualisations facetted by a categorical value.
5.1.1 Logical Variables
For logical variables we facet on barplots of the levels, comparing TRUE, FALSE and missing data.
logical_vars <- logical_vars[!logical_vars %in% facet_varname] %>% sort()
for (plot_varname in logical_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_tbl <- data_tbl %>% filter(!are_na(!! plot_varname))
explore_plot <- ggplot(plot_tbl) +
geom_bar(aes(x = !! sym(plot_varname))) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c(facet_varname, "-Faceted Barplots for Variable: ",
plot_varname)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
plot(explore_plot)
}## --
## cancellation
5.1.2 Numeric Variables
For numeric variables, we facet on histograms of the data.
for (plot_varname in numeric_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_tbl <- data_tbl %>% filter(!are_na(!! plot_varname))
explore_plot <- ggplot(plot_tbl) +
geom_histogram(aes(x = !! sym(plot_varname)),
bins = hist_bins_count) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_x_continuous(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c(facet_varname, "-Faceted Histogram for Variable: ",
plot_varname)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
print(explore_plot)
}## --
## invoice_monthprop
## --
## price
## --
## quantity
## --
## stock_amount
5.1.3 Categorical Variables
We treat categorical variables like logical variables, faceting the barplots of the different levels of the data.
categorical_vars <- categorical_vars[!categorical_vars %in% facet_varname] %>% sort()
for (plot_varname in categorical_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_tbl <- data_tbl %>%
filter(!are_na(!! plot_varname)) %>%
mutate(
varname_trunc = fct_relabel(!! sym(plot_varname), str_trunc, width = 10)
)
explore_plot <- ggplot(plot_tbl) +
geom_bar(aes(x = varname_trunc)) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c(facet_varname, "-Faceted Histogram for Variable: ",
plot_varname)) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
plot(explore_plot)
}## --
## country
## --
## invoice_dom
## --
## invoice_dow
## --
## invoice_hour
## --
## invoice_minute
## --
## invoice_month
## --
## invoice_woy
## --
## invoice_ym
5.1.4 Date/Time Variables
Like the univariate plots, we facet on histograms of the years in the dates.
for (plot_varname in datetime_vars) {
cat("--\n")
cat(str_c(plot_varname, "\n"))
plot_tbl <- data_tbl %>% filter(!are_na(!! plot_varname))
explore_plot <- ggplot(plot_tbl) +
geom_histogram(aes(x = !! sym(plot_varname)),
bins = hist_bins_count) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(str_c(facet_varname, "-Faceted Histogram for Variable: ",
plot_varname))
plot(explore_plot)
}## --
## invoice_date
## --
## invoice_dttm
6 Custom Explorations
In this section we perform various data explorations.
6.1 Custom Checks for Data Integrity
We want to check the transaction data for consistency, so we create a table of all distinct
stock_codes_lookup_tbl <- data_tbl %>%
select(stock_code_upr, description) %>%
distinct() %>%
arrange(stock_code_upr, description) %>%
drop_na(description)
stock_codes_lookup_tbl %>% glimpse()## Rows: 6,342
## Columns: 2
## $ stock_code_upr <chr> "10002", "10002R", "10080", "10080", "10109", "10120", "10120", "10123C", "10123G", "10124A", "10124G", "10125", "10133", "10133", "10134"…
## $ description <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", "check", "GROOVY CACTUS INFLATABLE", "BENDY COLOUR PENCILS", "DOGGY RUBBER", "Zebra…
We now take a look at the first 50 rows of this table to get a sense of any
possible duplication of stock_code.
6.2 Explore Aggregate Amounts
We now turn our focus to aggregating the data set in various ways and inspect how these aggregate totals are distributed.
6.2.1 Invoice-Level Amounts
We first aggregate the data at the invoice level, and inspect how those amounts are distributed.
invoice_data_tbl <- data_tbl %>%
group_by(invoice_id) %>%
summarise(
.groups = "drop",
invoice_amount = sum(price * quantity) %>% round(2)
)
invoice_mean <- invoice_data_tbl %>% pull(invoice_amount) %>% mean() %>% round(2)
invoice_median <- invoice_data_tbl %>% pull(invoice_amount) %>% median() %>% round(2)
ggplot(invoice_data_tbl) +
geom_histogram(aes(x = invoice_amount), bins = 50) +
geom_vline(aes(xintercept = invoice_mean), colour = "black") +
geom_vline(aes(xintercept = invoice_median), colour = "red") +
xlab("Invoice Amount") +
ylab("Count") +
scale_x_log10(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle(
label = "Histogram Plot for Invoice Amount",
subtitle = glue("Mean is {invoice_mean}, Median is {invoice_median}")
)We see there is a broad range of different invoice totals, with mean and median being a few hundred pounds.
6.2.2 Customer-Level Amounts
customer_data_tbl <- data_tbl %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
customer_spend = sum(price * quantity) %>% round(2)
)
ggplot(customer_data_tbl) +
geom_histogram(aes(x = customer_spend), bins = 50) +
xlab("Customer Spend") +
ylab("Count") +
scale_x_log10(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle("Histogram Plot for Customer Spend")6.2.3 Stock Level Data
stock_price_counts_tbl <- data_tbl %>%
group_by(stock_code) %>%
summarise(
.groups = "drop",
n_prices = n(),
min_price = min(price),
p25_price = quantile(price, 0.25),
mean_price = mean(price) %>% round(2),
p50_price = median(price),
p75_price = quantile(price, 0.75),
max_price = max(price),
range_price = ((max_price - min_price) / mean_price) %>% round(4)
)
stock_price_counts_tbl %>% datatable()stock_distinct_price_counts_tbl <- data_tbl %>%
select(stock_code, price) %>%
distinct() %>%
group_by(stock_code) %>%
summarise(
.groups = "drop",
n_prices = n(),
min_price = min(price),
p25_price = quantile(price, 0.25),
mean_price = mean(price) %>% round(2),
p50_price = median(price),
p75_price = quantile(price, 0.75),
max_price = max(price),
range_price = (max_price - min_price) / mean_price
)
stock_distinct_price_counts_tbl %>% datatable()6.3 Construct Time-Series / Date-Based Data
Another way to look at this data is to combine all the invoice values by various time period such as daily, weekly and monthly to see how it looks.
As we are going to do a number of aggregations based on various aspects of the date, we construct a function that takes a table of data and adds a number of derived columns based on that date: things like day of week, calendar month and so on.
append_calendar_columns <- function(data_tbl, date_col) {
updated_data_tbl <- data_tbl %>%
mutate(
invoice_date = {{date_col}} %>% as.Date(),
invoice_month = {{date_col}} %>% format("%B"),
invoice_dow = {{date_col}} %>% format("%A"),
invoice_dom = {{date_col}} %>% format("%d"),
invoice_hour = {{date_col}} %>% format("%H"),
invoice_minute = {{date_col}} %>% format("%M"),
invoice_woy = {{date_col}} %>% format("%V"),
invoice_ym = {{date_col}} %>% format("%Y%m"),
.after = {{date_col}}
)
return(updated_data_tbl)
}
data_tbl %>%
select(excel_sheet, invoice_id, invoice_date, stock_amount) %>%
append_calendar_columns(invoice_date) %>%
glimpse()## Rows: 1,021,424
## Columns: 11
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Year 2009-2…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "489434", "489434", "489434", "489435", "489435", "489435", "489435", "489436", "489436"…
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-…
## $ invoice_month <chr> "December", "December", "December", "December", "December", "December", "December", "December", "December", "December", "December", "Decem…
## $ invoice_dow <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesd…
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", …
## $ invoice_hour <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", …
## $ invoice_minute <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", …
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", "49", …
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912", "200912"…
## $ stock_amount <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, 30.60, 45.00, 39.60, 30.60, 59.50, 98.10, 17.85, 27.04, 20.85, 17.40, 19.80, 19.8…
6.3.1 Create Univariate Time-Series of Amounts
ts_data_tbl <- data_tbl %>%
mutate(
ts_week = format(invoice_dttm, "%Y-%U"),
ts_month = format(invoice_dttm, "%Y-%m")
)
ts_daily_tbl <- ts_data_tbl %>%
group_by(label = invoice_date %>% format("%Y-%m-%d")) %>%
summarise(
.groups = "drop",
period_date = min(invoice_date),
total_spend = sum(price * quantity) %>% round(2)
)
ggplot(ts_daily_tbl) +
geom_line(aes(x = period_date, y = total_spend)) +
expand_limits(y = 0) +
scale_y_continuous(labels = label_comma()) +
xlab("Date") +
ylab("Total Spend") +
ggtitle("Lineplot of Total Spend by Day")ts_weekly_tbl <- ts_data_tbl %>%
group_by(label = ts_week) %>%
summarise(
.groups = "drop",
period_date = min(invoice_date),
total_spend = sum(price * quantity)
)
ggplot(ts_weekly_tbl) +
geom_line(aes(x = period_date, y = total_spend)) +
expand_limits(y = 0) +
scale_y_continuous(labels = label_comma()) +
xlab("Date") +
ylab("Total Spend") +
ggtitle("Lineplot of Total Spend by Week")ts_monthly_tbl <- ts_data_tbl %>%
group_by(label = ts_month) %>%
summarise(
.groups = "drop",
period_date = min(invoice_date),
total_spend = sum(price * quantity) %>% round(2)
)
ggplot(ts_monthly_tbl) +
geom_line(aes(x = period_date, y = total_spend)) +
expand_limits(y = 0) +
scale_y_continuous(labels = label_comma()) +
xlab("Date") +
ylab("Total Spend") +
ggtitle("Lineplot of Total Spend by Month")To avoid dealing with multiple files for the time series, we combine them into a single object.
6.3.2 Calendar-Based Boxplots
We have aggregated our data across time periods, but it is also worth looking at both transaction-level and invoice-level amount over time.
ggplot(data_tbl) +
geom_boxplot(aes(x = invoice_woy, y = stock_amount, group = invoice_woy)) +
scale_y_log10(labels = label_comma()) +
xlab("Week of Year") +
ylab("Transaction Amount") +
ggtitle("Boxplot of Transaction Sizes by Week of Year") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))plot_tbl <- data_tbl %>%
group_by(invoice_woy, invoice_id) %>%
summarise(
.groups = "drop",
invoice_amount = sum(stock_amount) %>% round(2)
)
ggplot(plot_tbl) +
geom_boxplot(aes(x = invoice_woy, y = invoice_amount, group = invoice_woy)) +
scale_y_log10(labels = label_comma()) +
xlab("Week of Year") +
ylab("Invoice Amount") +
ggtitle("Boxplot of Invoice Amounts by Week of Year") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))6.4 Check Distribution of Daily Purchases
We now look at individual invoice amounts, and look at how they are distributed on a daily basis.
daily_distribution_tbl <- data_tbl %>%
group_by(invoice_date, invoice_id) %>%
summarise(
.groups = "drop",
total_spend = sum(stock_amount)
)
daily_distribution_tbl %>% glimpse()## Rows: 53,628
## Columns: 3
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12…
## $ invoice_id <chr> "489434", "489435", "489436", "489437", "489438", "489439", "489440", "489441", "489442", "489443", "489444", "489445", "489446", "489447", …
## $ total_spend <dbl> 505.30, 145.80, 630.33, 310.75, 2286.24, 426.30, 50.40, 344.34, 382.37, 285.06, 141.00, 308.44, 996.10, 130.00, 570.24, 196.10, 485.79, 589.…
7 Miscellaneous Analyses
In this section we try a few different types of analysis.
7.1 Graph Analysis
We can treat this data as a graph, turning both invoices and stock items into nodes on the graph, and create a connection between stock and invoices when the item occurs on the invoice.
This graph will get large, but it is a starting point.
stock_nodes_tbl <- data_tbl %>%
select(stock_code) %>%
distinct() %>%
transmute(node_label = stock_code, node_type = "stock")
invoice_nodes_tbl <- data_tbl %>%
select(invoice_id) %>%
distinct() %>%
transmute(node_label = invoice_id, node_type = "invoice")
nodes_tbl <- list(stock_nodes_tbl, invoice_nodes_tbl) %>%
bind_rows()
edges_tbl <- data_tbl %>%
select(stock_code, invoice_id, quantity, price)
basket_graph <- tbl_graph(
nodes = nodes_tbl,
edges = edges_tbl,
directed = FALSE
)8 R Environment
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 (2020-06-22)
## os Ubuntu 20.04.1 LTS
## system x86_64, linux-gnu
## ui RStudio
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2021-02-07
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.0.0)
## backports 1.1.10 2020-09-15 [1] RSPM (R 4.0.2)
## blob 1.2.1 2020-01-20 [1] RSPM (R 4.0.0)
## bookdown 0.20 2020-06-23 [1] RSPM (R 4.0.2)
## broom 0.7.1 2020-10-02 [1] RSPM (R 4.0.2)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.0.0)
## cli 2.1.0 2020-10-12 [1] RSPM (R 4.0.2)
## colorspace 1.4-1 2019-03-18 [1] RSPM (R 4.0.0)
## conflicted * 1.0.4 2019-06-21 [1] RSPM (R 4.0.0)
## cowplot * 1.1.0 2020-09-08 [1] RSPM (R 4.0.2)
## crayon 1.3.4 2017-09-16 [1] RSPM (R 4.0.0)
## crosstalk 1.1.0.1 2020-03-13 [1] RSPM (R 4.0.0)
## DBI 1.1.0 2019-12-15 [1] RSPM (R 4.0.0)
## dbplyr 1.4.4 2020-05-27 [1] RSPM (R 4.0.0)
## digest 0.6.25 2020-02-23 [1] RSPM (R 4.0.0)
## dplyr * 1.0.2 2020-08-18 [1] RSPM (R 4.0.2)
## DT * 0.15 2020-08-05 [1] RSPM (R 4.0.2)
## ellipsis 0.3.1 2020-05-15 [1] RSPM (R 4.0.0)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.0.0)
## evir * 1.7-4 2018-03-20 [1] RSPM (R 4.0.0)
## fansi 0.4.1 2020-01-08 [1] RSPM (R 4.0.0)
## farver 2.0.3 2020-01-16 [1] RSPM (R 4.0.0)
## forcats * 0.5.0 2020-03-01 [1] RSPM (R 4.0.0)
## fs * 1.5.0 2020-07-31 [1] RSPM (R 4.0.2)
## generics 0.0.2 2018-11-29 [1] RSPM (R 4.0.0)
## ggplot2 * 3.3.2 2020-06-19 [1] RSPM (R 4.0.1)
## glue * 1.4.2 2020-08-27 [1] RSPM (R 4.0.2)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.0.0)
## haven 2.3.1 2020-06-01 [1] RSPM (R 4.0.2)
## hms 0.5.3 2020-01-08 [1] RSPM (R 4.0.0)
## htmltools 0.5.0 2020-06-16 [1] RSPM (R 4.0.1)
## htmlwidgets 1.5.2 2020-10-03 [1] RSPM (R 4.0.2)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.0.2)
## igraph 1.2.6 2020-10-06 [1] RSPM (R 4.0.2)
## jsonlite 1.7.1 2020-09-07 [1] RSPM (R 4.0.2)
## knitr 1.30 2020-09-22 [1] RSPM (R 4.0.2)
## labeling 0.3 2014-08-23 [1] RSPM (R 4.0.0)
## lifecycle 0.2.0 2020-03-06 [1] RSPM (R 4.0.0)
## lubridate * 1.7.9 2020-06-08 [1] RSPM (R 4.0.2)
## magrittr * 1.5 2014-11-22 [1] RSPM (R 4.0.0)
## memoise 1.1.0 2017-04-21 [1] RSPM (R 4.0.0)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.0.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.0.0)
## pillar 1.4.6 2020-07-10 [1] RSPM (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.0.0)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.0.0)
## R6 2.4.1 2019-11-12 [1] RSPM (R 4.0.0)
## Rcpp 1.0.5 2020-07-06 [1] RSPM (R 4.0.2)
## readr * 1.4.0 2020-10-05 [1] RSPM (R 4.0.2)
## readxl 1.3.1 2019-03-13 [1] RSPM (R 4.0.2)
## reprex 0.3.0 2019-05-16 [1] RSPM (R 4.0.0)
## rlang * 0.4.8 2020-10-08 [1] RSPM (R 4.0.2)
## rmarkdown 2.4 2020-09-30 [1] RSPM (R 4.0.2)
## rmdformats 0.3.7 2020-03-11 [1] RSPM (R 4.0.0)
## rstudioapi 0.11 2020-02-07 [1] RSPM (R 4.0.0)
## rvest 0.3.6 2020-07-25 [1] RSPM (R 4.0.2)
## scales * 1.1.1 2020-05-11 [1] RSPM (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.0.0)
## snakecase * 0.11.0 2019-05-25 [1] RSPM (R 4.0.0)
## stringi 1.5.3 2020-09-09 [1] RSPM (R 4.0.2)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.0.0)
## tibble * 3.0.4 2020-10-12 [1] RSPM (R 4.0.2)
## tidygraph * 1.2.0 2020-05-12 [1] RSPM (R 4.0.2)
## tidyr * 1.1.2 2020-08-27 [1] RSPM (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [1] RSPM (R 4.0.0)
## tidyverse * 1.3.0 2019-11-21 [1] RSPM (R 4.0.0)
## utf8 1.1.4 2018-05-24 [1] RSPM (R 4.0.0)
## vctrs * 0.3.4 2020-08-29 [1] RSPM (R 4.0.2)
## withr 2.3.0 2020-09-22 [1] RSPM (R 4.0.2)
## xfun 0.18 2020-09-29 [1] RSPM (R 4.0.2)
## xml2 1.3.2 2020-04-23 [1] RSPM (R 4.0.0)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.0.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library